Project 5 - Unsupervised Learning¶Problem Statement: Credit Card Customer Segmentation¶Background¶AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team proposes to run personalised campaigns to target new customers as well as upsell to existing customers. Another insight from the market research was that the customers perceive the support services of the back poorly. Based on this, the Operations team wants to upgrade the service delivery model, to ensure that customers queries are resolved faster. Head of Marketing and Head of Delivery both decide to reach out to the Data Science team for help.
Objective¶Steps and Tasks:¶Attribute Information:¶Data is of various customers of a bank with their credit limit, the total number of credit cards the customer has, and different channels through which customer has contacted the bank for any queries, different channels include visiting the bank, online and through a call.
Input variables:¶Key Questions¶import warnings
warnings.filterwarnings('ignore')
import pandas as pd #Read files
import numpy as np # numerical libraries
# Import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
#%matplotlib inline
# Import libraries to work on K-means
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from IPython.display import Image
from os import system
pd.options.display.float_format = '{:,.2f}'.format
# Below we will read the data from the local folder
df = pd.read_excel('Credit Card Customer Data.xlsx')
# Now display the header
print ('Credit Card Customer Data.xlsx data set:')
df.head(5)
# "Sl-No" looks like the index column of the data, so before drop it I will display it as an index.
df = pd.read_excel('Credit Card Customer Data.xlsx', index_col= 'Sl_No')
# And now will display the header again
print ('Credit Card Customer Data.xlsx data set:')
df.head(10)
df.tail() ## to know how the end of the data looks like
df.info() # here we will see the number of entires(rows and columns), dtype, and non-nullcount
print(f"The given dataset contains {df.shape[0]} rows and {df.shape[1]} columns")
print(f"The given dataset contains {df.isna().sum().sum()} Null value")
neg_exp=df[df.lt(0)] # this is to see the number of negative values present
print (" the number of negative entries is",sum(n < 0 for n in df.values.flatten()))
# the output might be taken in consideration later on in the calculations.
df.shape # size of the data set (# rows or entries, # columns or variables)
df.describe().transpose() # Transpose is used here to read better the attribute
df.nunique() # Number of unique values in a column
# this help to identify categorical values and will give us an idea of possible groups or clusters
#Since the "customer Key" has almost the same unique values than the total of imputs (660), propabaly some customer have more than 1 entry
# This means that Either the entries are duplicated or perhaps in different period of time bank branch.
# I will try to confirm it with the next line
pd.value_counts(df['Customer Key'])
repeated_cust=(47437, 37252,97935, 96929, 50706 ) ## list of duplicate customer taken from the result above
df.loc[df['Customer Key'].isin(repeated_cust)] # Thid command is to see the duplicate data
Based on the results above most of the variables are categorical and tell us that we could group them, for example in 6 or 11 groupsthere are 110 different Credit card limit with a wide range from min to max. The table above shows that there is outliers for this variable for the difference between the 75% and the max value like the Avg_Crdit_limitThe data need to be scalated in order to compares the variables, specially for the avg credit limit which has bigger numbers than the restpd.options.display.float_format = '{:,.3f}'.format # to see 3 decimals in the output of the cell below
# Now we will get a list of unique values to evalaute how to arrange the data set
for a in list(df.columns):
n = df[a].unique()
# if number of unique values is less than 30, print the values. Otherwise print the number of unique values
if len(n)<30:
print(a + ': ')
print(df[a].value_counts(normalize=True))
print()
else:
print(a + ': ' +str(len(n)) + ' unique values')
print()
The number of Total visits online and the visits to the bank with more repetition is 2(mode).The Mode of Calls_made and number of credit cart is the same: 4.Insight 2# I will escale it the data using z scores in order to:
## - Compare the variables between them and
## - Being able to apply clustering methods.
interest_df = df.drop(['Customer Key'], axis=1) # will drop the customer key since it will not add any value to the output
from scipy.stats import zscore
interest_df_z = interest_df.apply(zscore)
interest_df_z.head()
plt.subplots(figsize=(15,10))
ax = sns.boxplot(data=interest_df_z)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45);
interest_df_z.hist(stacked=False, bins=100, figsize=(30,30), layout=(2,3));
# Histogram will show graphically what was seen in the unique values above.
## Please notice: I found this commands on internet and it allows a better and faster visualitzation of multiples displot,
## Used in the previous project
#### With this code we can also see the mean of each variable which is better than the histogram plotted above.
##### This will give me an Initial idea of possible groups but it will be seen again below in the Bi-variate analysis
import itertools
import statistics
cols = [i for i in interest_df.columns]
fig = plt.figure(figsize=(20, 25))
for i,j in itertools.zip_longest(cols, range(len(cols))):
plt.subplot(5,2,j+1)
ax = sns.distplot(df[i],color='blue',rug=True)
plt.axvline(df[i].mean(),linestyle="dashed",label="mean", color='black')
plt.axvline(statistics.mode(df[i]),linestyle="dashed",label="Mode", color='Red')
plt.axvline(statistics.median(df[i]),linestyle="dashed",label="MEDIAN", color='Green')
plt.legend()
plt.title(i)
plt.xlabel("")
Also for each variables we can see the combination of at least 2 gausians or more, for instance in total crdit cards (with normal distribution) we can see 4. It means that we could use the amount of credit card to create groups.interest_df_z.corr() # with this function will try to see a correlation between variables numerically
has highest correlation (>=55%) with Total_credit_Cards and Visit_online and 41% with Total_calls_ made(negative correlation)has a correlation of 31% with Visits_Bank and a big negative correlation with Calls_made (65%).shows a good negative correlation > 50% with total_calls_made & Total_visits_onlineg = sns.PairGrid(interest_df_z)
g.map_upper(plt.scatter)
g.map_lower(sns.lineplot)
g.map_diag(sns.kdeplot, lw=3, legend=True);
sns.pairplot(interest_df_z , hue='Total_visits_bank' , diag_kind = 'kde')
# using Visit_bank beacuse has less Number of unique values in a column
plt.show()
#Another correlation methods
plt.figure(figsize=(10,10))
mask = np.zeros_like(interest_df_z.corr('spearman'))
mask[np.triu_indices_from(mask)] = True
ax =sns.heatmap(interest_df_z.corr(),
annot=True,
linewidths=.5,
center=0,
cmap="YlGnBu",
mask= mask,
)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45);
plt.show()
#pip install pandas-profiling[notebook]
from pandas_profiling import ProfileReport
profile = ProfileReport(interest_df_z )
profile
The main takeaway of this step is that for the next projects I will apply it at the beggning of the study. It give an idea of where to look at to see more details, like relevant correlations and more important variablesMost of the observations from this report about the data were mentioned already#Finding optimal no. of clusters using the Data frame before scalat it
from scipy.spatial.distance import cdist
clusters=range(1,10)
meanDistortions=[]
for k in clusters:
model=KMeans(n_clusters=k)
model.fit(interest_df)
prediction=model.predict(interest_df)
meanDistortions.append(sum(np.min(cdist(interest_df, model.cluster_centers_, 'euclidean'), axis=1)) / interest_df.shape[0])
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
#Finding optimal no. of clusters using the scalated dataframe
## This was done to see if there was a big difference in the graphics, in both cases 3 is n_cluster obtained.
from scipy.spatial.distance import cdist
clusters=range(1,10)
meanDistortions=[]
for k in clusters:
model2=KMeans(n_clusters=k)
model2.fit(interest_df_z)
prediction=model2.predict(interest_df_z)
meanDistortions.append(sum(np.min(cdist(interest_df_z, model2.cluster_centers_, 'euclidean'), axis=1)) / interest_df_z.shape[0])
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method using scaled dataframe')
#Set the value of k=3
model3 = KMeans(n_clusters=3,n_init = 15, random_state=2345)
model3.fit(interest_df_z)
preds = model3.predict(interest_df_z)
from sklearn.metrics import silhouette_score
labels = model3.labels_
silhouette_score(interest_df_z, labels, metric='euclidean')
#Set the value of k=4
## This step is just to see the result of the Score using 4 clusters and we confirm that 3 is better.
model4 = KMeans(n_clusters=4, n_init = 15, random_state=2345)
model4.fit(interest_df_z)
preds4= model4.predict(interest_df_z)
labels4 = model4.labels_
silhouette_score(interest_df_z, labels4, metric='euclidean')
centroids = model3.cluster_centers_
centroids
#Clculate the centroids for the columns to profile
centroid_df = pd.DataFrame(centroids, columns = list(interest_df_z) )
print(centroid_df)
## creating a new dataframe only for labels and converting it into categorical variable
df_labels = pd.DataFrame(model3.labels_ , columns = list(['labels']))
df_labels['labels'] = df_labels['labels'].astype('category')
# Joining the label dataframe with the data frame.
df_labeled = interest_df.join(df_labels)
df_analysis = (df_labeled.groupby(['labels'] , axis=0)).head(4177) # the groupby creates a groupeddataframe that needs
# to be converted back to dataframe.
df_analysis
df_labeled['labels'].value_counts()
http://blog.mahler83.net/2019/10/rotating-3d-t-sne-animated-gif-scatterplot-with-matplotlib/
## This code didn't work as expeected, the 3D plot was supposed to rotate.
from mpl_toolkits.mplot3d import axes3d, Axes3D
fig = plt.figure(figsize=(10,10))
ax = Axes3D(fig)
x = interest_df.Total_visits_bank
y = interest_df.Total_visits_online
z = interest_df.Total_calls_made
g = ax.scatter(x, y, z, c=x, marker='o', depthshade=False, cmap='Paired')
ax.set_xlabel('Total Bank Visits')
ax.set_ylabel('Total Visits Online')
ax.set_zlabel('Total Calls Made')
# produce a legend with the unique colors from the scatter
legend = ax.legend(*g.legend_elements(), loc="lower center", title="Total bank visits", borderaxespad=-10, ncol=4)
ax.add_artist(legend)
# plt.show()
from matplotlib import animation
def rotate(angle):
ax.view_init(azim=angle)
angle = 1
ani = animation.FuncAnimation(fig, rotate, frames=np.arange(0, 360, angle), interval=1)
ani.save('Cluster_plot.gif', writer=animation.PillowWriter(fps=25));
# K = 3
final_model=KMeans(n_clusters=3)
final_model.fit(interest_df_z)
prediction=final_model.predict(interest_df_z)
#Append the prediction
interest_df["GROUP"] = prediction #adding the predictions to the unscaled data
interest_df_z["GROUP"] = prediction #adding the predictions to the scaled data
print("Groups Assigned : \n")
interest_df_z
interest_df.groupby("GROUP").count()
interest_df_z.boxplot(by = 'GROUP', layout=(2,3), figsize=(20, 15))
There is a positivi correlation between avg credit limit and number of credit card. And a negative correlation of those with the total calls made, it is to say, the more credit limit and credit card number, the less call made to the bank.
Visit to the banks is inverse to the visits online, it is to say, the group 1 that more visit the bank is the one who use less the visits online. This groups seems to prefer more personal contact with the the Bank.
interest_df_z.drop ("GROUP", inplace=True, axis=1) # Deleting the colum added before
print (interest_df_z.shape)
interest_df_z.head()
#Use ward as linkage metric and distance as Eucledian
#### generate the linkage matrix
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(interest_df_z, 'ward', metric='euclidean')
Z.shape
Z[:]
plt.figure(figsize=(25, 10))
dendrogram(Z)
plt.show()
# Hint: Use truncate_mode='lastp' attribute in dendrogram function to arrive at dendrogram
dendrogram(
Z,
truncate_mode='lastp', # show only the last p merged clusters
p=3, # show only the last p merged clusters
)
plt.show()
max_d = 20
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z, max_d, criterion='distance')
clusters
# Calculate Avg Silhoutte Score
#from sklearn.metrics import silhouette_score
silhouette_score(interest_df_z,clusters)
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt
plt.figure(figsize=(18, 16))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
Z = linkage(interest_df_z, 'ward')
dendrogram(Z,leaf_rotation=90.0,p=5,color_threshold=30, leaf_font_size=10,truncate_mode='level')
plt.tight_layout()
Since we didn't cover the boxplot for Hierarchical cluster in the mentored session, I searched on internet to find a create the boxplot and compared it with the step 3.2 and got to what is shown below
from sklearn.cluster import AgglomerativeClustering
model5=AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='average')
model5.fit(interest_df_z)
prediction2=model5.labels_
interest_df['Group2'] = prediction2 #to differenciate from the Group created in the k-means part
interest_df_z['Group2'] = prediction2
interest_df.groupby('Group2').count()
We can see that the group #0 has higher number of clients and represent more than 50% of the all the clients in the studyinterest_df_z.boxplot(by = 'Group2', layout=(2,3), figsize=(20, 15))
This boxplot is vey similar to the observed in Insight 12:
There is a positivi correlation between avg credit limit and number of credit card. And a negative correlation of those with the total calls made, it is to say.
Visit to the banks is inverse to the visits online.
For this excercise with the used parameters the result was similar in both techniques. The K-mean gave a slight better Score but overal the relation between the groups were the same.
At the time of this report, I adon't know why the order of the groups changed from 1 technique to another. It is to say, the group#2 in K-means is equivalent to the group#1 in Hierarchical.
What i can see until this point is that the Hierarchical Cluster techniques allows to study easier the number of groups and similarly it is better for me from the visualization point of view using the dendogram.
Given the 3 groups and the differences between them mentioned above, next step is to impact of each group in the economy of the company, in this case a Bank. Beacuase the bigger group with more than 50% of clients is the most active visting the bank and 2nd with more credit cards, but maybe the ammount in expenses is not as atractive to the bank as the quatities managed by the smaller group. The marketing team have the option to improve the service in the phone, in the branch with person to person service or in the webpage. The priority will be given to the most attractive clients for the bank.
Without having all the economic profile of the client, i would suggest to start with the promotion online since the smallwer group, which manage more credit limit is the most active on the net service.